2. NLP Analysis¶

Speaker analysis¶

  • What are the most important characters?
  • How do the importances of the characters change over time?
  • Can we detect key moments in the show based on speech portions?

Preprocessing for Speaker Analysis¶

  • Extract directorials (and remove deleted scenes + respective column):
In [7]:
fig = px.bar(lines_per_character.sort_values("season"), x="speaker", y="line_text", color='season', color_discrete_sequence=px.colors.qualitative.Prism, title='Lines per character')
fig.update_xaxes(categoryorder='array', categoryarray= top20_characters)
fig.update_yaxes(title='number of lines')
In [10]:
fig = px.bar(words_per_character.sort_values("season"), x="speaker", y="word_count", color='season', color_discrete_sequence=px.colors.qualitative.Prism, title='Words per character')
fig.update_xaxes(categoryorder='array', categoryarray= top20_characters)
fig.update_yaxes(title='number of words')

Development over time¶

In [12]:
fig.show()
  • certain events in the show can be identified
  • Michael clearly dominates in terms of speech portions and is (after season 7) "replaced" jointly by Dwight, Jim, Pam and especially Andy

Word Analysis¶

  • Primarly: extracting frequent or important words from lines
  • Can we identify words that are specifically relevant (in certain parts of the show)?
  • Can we derive information on the speaking style in the show from the words used?

Preprocessing¶

  • extract directorials
  • remove punctuation
  • lowercase
  • expand contractions
  • tokenize lines: TreeBankWord + tokenize special self-defined tokens: character names (e.g. "Michael Scott") and compound words (e.g. "Dunder Mifflin")
In [13]:
param_dict_tokens = {
    "concat_scenes": False,
    "extract_direc": True, 
    "remove_punct": True, 
    "rmv_stopwords": False,
    "lwr": True, 
    "exp_contractions": True,
    "conversion": "tokenize",
    "tokenizer": ("TreeBankWord", True, PATH+"character_names.csv", PATH+"compound_words_the-office_by_chatgpt.txt")
}

df_tokens = preprocess(df_raw, **param_dict_tokens)
df_tokens.head()
Out[13]:
season episode scene line_text speaker season_episode directorials
id
1 1 1 1 [all, right, jim, your, quarterlies, look, ver... Michael 101 NaN
2 1 1 1 [oh, i, told, you, i, could, not, close, it, so] Jim 101 NaN
3 1 1 1 [so, you, have, come, to, the, master, for, gu... Michael 101 NaN
4 1 1 1 [actually, you, called, me, in, here, but, yeah] Jim 101 NaN
5 1 1 1 [all, right, well, let, me, show, you, how, it... Michael 101 NaN

Word Cloud¶

In [14]:
all_words =  [item for sublist in df_tokens["line_text"].tolist() for item in sublist]
all_words_freq = nltk.FreqDist(all_words)
df_all_words_freq = pd.Series(dict(all_words_freq)).sort_values(ascending=False)

wordcloud = WordCloud(width=800, height=300, background_color="white", max_words=100, contour_width=3, contour_color='steelblue').generate(" ".join(all_words))
wordcloud.to_image()
Out[14]:
  • mostly words that are common in everyday language use
  • also some names of characters
  • topic related words are rare (e.g. job, office, work, call, party, friend, love)

Important Words¶

Tagging¶

  • tag words with part-of-speech (POS) tags to identify most common words by lexical category (especially nouns)
  1. Brill Tagger was used, but did not show sufficient results.
  2. POS Tagger as second approach:
In [19]:
param_dict_tokens_nostopwords = {
    "concat_scenes": False,
    "extract_direc": True, 
    "remove_punct": False, 
    "rmv_stopwords": False,
    "lwr": True, 
    "exp_contractions": False,
    "conversion": "pos_tag"
}
df_tokens_tagged = preprocess(df_raw, **param_dict_tokens_nostopwords)
df_tokens_tagged.head()
Out[19]:
season episode scene line_text speaker season_episode directorials
id
1 1 1 1 [(all, DT), (right, JJ), (jim., NN), (your, PR... Michael 101 NaN
2 1 1 1 [(oh, UH), (,, ,), (i, JJ), (told, VBD), (you.... Jim 101 NaN
3 1 1 1 [(so, RB), (you, PRP), ('ve, VBP), (come, VBN)... Michael 101 NaN
4 1 1 1 [(actually, RB), (,, ,), (you, PRP), (called, ... Jim 101 NaN
5 1 1 1 [(all, DT), (right., NN), (well, RB), (,, ,), ... Michael 101 NaN
In [22]:
all_tagged_freq = nltk.FreqDist(all_words_tagged_filtered_jj)
df_all_tagged_freq = pd.Series(dict(all_tagged_freq)).sort_values(ascending=False).drop('i')

fig2 = px.bar(y=df_all_tagged_freq[:10].index, x=df_all_tagged_freq[:10].values, orientation='h', title='Most common Adjectives', height=450)
fig2.update_traces(width=0.5)
fig2.show()

TF-IDF¶

to determine important words in the dataset

In [26]:
features_tfidf_agg[0:10]
Out[26]:
you        0.072911
michael    0.059165
is         0.056919
to         0.054235
the        0.052029
it         0.047500
dwight     0.045659
jim        0.045000
that       0.040082
pam        0.039464
dtype: float64

Lexical Dispersion Plot¶

In [28]:
target_words = ['scranton', 'stamford', 'philly', 'dundie', 'boat']
fig = plt.figure(figsize=(10,4))
visualizer = DispersionPlot(target_words, ax=fig.add_subplot(111))
visualizer.fit([all_words])
visualizer.show();

Word Analysis by Speaker: Michael vs. Dwight¶

Most common words: Michael vs. Dwight¶

In [33]:
fig.show()

3-grams: Michael vs. Dwight¶

In [35]:
df_ngrams_michael_dwight
Out[35]:
ngram Michael ngram Dwight
0 ((let, us, go), 75) ((let, us, go), 38)
1 ((let, us, get), 35) ((hey, hey, hey), 21)
2 ((hey, hey, hey), 26) ((yes, yes, yes), 17)
3 ((come, let, us), 23) ((let, us, get), 15)
4 ((oh, god, oh), 21) ((ha, ha, ha), 13)
5 ((beep, beep, beep), 21) ((go, go, go), 13)
6 ((let, us, see), 18) ((jim, jim, jim), 12)
7 ((god, oh, god), 17) ((wait, wait, wait), 10)
8 ((stop, stop, stop), 16) ((whoa, whoa, whoa), 9)
9 ((na, na, na), 16) ((la, la, la), 9)
10 ((go, let, us), 13) ((let, us, see), 7)
11 ((yeah, yeah, yeah), 13) ((michael, michael, michael), 7)
12 ((right, right, right), 13) ((zero, zero, zero), 7)
13 ((blah, blah, blah), 13) ((volunteer, sheriffs, deputy), 6)
14 ((go, go, go), 13) ((one, two, three), 6)
  • interesting 3-grams support that everyday spoken language is used in the show (colloquial language/slang)
  • characteristic for Michael: "oh, god, oh" or nonsense like "beep, beep, beep"
  • characteristic for Dwight: names like "jim, jim, jim" or "michael, michael, michael", also: "volunteer, sheriffs, deputy"

Sentiment Analysis¶

  • Which sentiments dominate in the show and how do they develop over time?
  • Can the sentiments describe the development of a relationship in the show?

Sentiments: negative, neutral, positive¶

First sentence of the show¶

In [35]:
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("All right Jim. Your quarterlies look very good. How are things at the library?")
Out[35]:
{'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'compound': 0.4927}

Sentiment development in a certain episode¶

Episode "Goodbye Michael"¶
In [39]:
# display sentiment over time for a given season and episode
fig = px.line(df_rolling[(df_rolling["season"] == 7) & (df_rolling['episode'] >= 21) & (df_rolling['episode'] < 22)],  x="id", y=["neg", "neu", "pos"], title="Sentiment over time", color_discrete_sequence=['rgb(213,94,0)', 'rgb(240,228,66)', 'rgb(0,158,115)'], height=300)
fig.show()
  • rapid changes of positive and negative sentiment: emotional episode
  • at the end, when michael leaves, the sentiment is very positive

Sentiment Analysis with different emotions¶

In [148]:
pd.DataFrame(emotion_analysis)[["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]].rolling(20).mean().plot(cmap="Dark2", figsize=(12,4)); plt.xlabel("lines"); plt.ylabel("degree"); plt.title("Emotions throughout the series: Jim & Dwight", fontsize=18)
plt.show()
  • relationship between Jim & Dwight becomes less neutral in the course of the show
  • consistent with the storyline where they develop a special kind of "friendship" towards the end
  • note that this is a moving average, e.g. peaks may be flattened here
In [146]:
pd.DataFrame(emotion_analysis)[["anger", "disgust", "fear", "joy", "neutral", "sadness", "surprise"]].rolling(20).mean().plot(cmap="Dark2", figsize=(12,4))
plt.xlabel("lines")
plt.ylabel("degree")
plt.title("Emotions throughout the series: Jim & Pam", fontsize=18)
plt.legend(frameon=True, framealpha=.8)
plt.show()
  • generally: emotional rollercoaster
  • line 101: Jim asks Pam for a date
  • events cannot be exactly derived from the plot due to moving average

Topic Modeling¶

  • What topics do people talk about in "The Office"?
  • How does this change over time?

Approaches:

  • LDA (Latent Dirichlet Allocation) with preprocessing through TF-IDF scores and CountVectorizer - results for 10 topics not sufficient
  • New approach: BERTopic (Embeddings with SBert, Dimension reduction with UMAP, Clustering with HDBSCAN)
In [101]:
topics, probs = topic_model.fit_transform(data_unp) 

# show the results (more than 900 topics)
topic_model.visualize_topics()

Topic Reduction¶

In [107]:
topic_model.reduce_topics(data_unp, nr_topics=30)

topic_model.visualize_topics()
In [138]:
topic_model.visualize_barchart([1,5,16,21], width=225)
  • topic 5: birthday + christmas related, but also "and"
    (topic reduction also introduces impurity)

Topics over Time¶

In [131]:
topic_model.visualize_topics_over_time(topics_over_time, topics=[5,7,14], width=960, height=400)
  • topics over time correspond to certain events in the show

Network Analysis¶

  • Which people in "The Office" interact with each other?
  • How does this change over time?